notebooks/tutorials/cloud-ml-engine/Training and prediction with scikit-learn.ipynb (570 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training and prediction with scikit-learn\n",
"\n",
"This notebook demonstrates how to use AI Platform to train a simple classification model using `scikit-learn`, and then deploy the model to get predictions.\n",
"\n",
"You train the model to predict a person's income level based on the [Census Income data set](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n",
"\n",
"Before you jump in, let’s cover some of the different tools you’ll be using:\n",
"\n",
"+ [AI Platform](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.\n",
"\n",
"+ [Cloud Storage](https://cloud.google.com/storage/) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.\n",
"\n",
"+ [Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. This notebook introduces several `gcloud` and `gsutil` commands, which are part of the Cloud SDK. Note that shell commands in a notebook must be prepended with a `!`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Set up your environment\n",
"\n",
"### Enable the required APIs\n",
"\n",
"In order to use AI Platform, confirm that the required APIs are enabled:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!gcloud services enable ml.googleapis.com\n",
"!gcloud services enable compute.googleapis.com"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a storage bucket\n",
"Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data.\n",
"\n",
"Start by defining a globally unique name.\n",
"\n",
"For more information about naming buckets, see [Bucket name requirements](https://cloud.google.com/storage/docs/naming#requirements)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"BUCKET_NAME = \"your-new-bucket\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the examples below, the `BUCKET_NAME` variable is referenced in the commands using `$`.\n",
"\n",
"Create the new bucket with the `gsutil mb` command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!gsutil mb gs://$BUCKET_NAME/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### About the data\n",
"\n",
"The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample\n",
"uses for training is provided by the [UC Irvine Machine Learning\n",
"Repository](https://archive.ics.uci.edu/ml/datasets/).\n",
"\n",
"Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided \"AS IS\" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.\n",
"\n",
"The data used in this tutorial is located in a public Cloud Storage bucket:\n",
"\n",
" gs://cloud-samples-data/ml-engine/sklearn/census_data/ \n",
"\n",
"The training file is `adult.data` ([download](https://storage.googleapis.com/cloud-samples-data/ml-engine/sklearn/census_data/adult.data)) and the evaluation file is `adult.test` ([download](https://storage.googleapis.com/cloud-samples-data/ml-engine/sklearn/census_data/adult.test)). The evaluation file is not used in this tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create training application package\n",
"\n",
"The easiest (and recommended) way to create a training application package is to use `gcloud` to package and upload the application when you submit your training job. This method allows you to create a very simple file structure with only two files. For this tutorial, the file structure of your training application package should appear similar to the following:\n",
"\n",
" census_training/\n",
" __init__.py\n",
" train.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a directory locally:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir census_training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a blank file named `__init__.py`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!touch ./census_training/__init__.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save training code in one Python file in the `census_training` directory. The following cell writes a training file to the `census_training` directory. The training file performs the following operations:\n",
"+ Loads the data into a pandas `DataFrame` that can be used by `scikit-learn`\n",
"+ Fits the model is against the training data\n",
"+ Exports the model with the [Python `pickle` library](https://docs.python.org/3/library/pickle.html)\n",
"\n",
"The following model training code is not executed within this notebook. Instead, it is saved to a Python file and packaged as a Python module that runs on AI Platform after you submit the training job."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile ./census_training/train.py\n",
"import argparse\n",
"import pickle\n",
"import pandas as pd\n",
"\n",
"from google.cloud import storage\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.feature_selection import SelectKBest\n",
"from sklearn.pipeline import FeatureUnion\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import LabelBinarizer\n",
"\n",
"parser = argparse.ArgumentParser()\n",
"parser.add_argument(\"--bucket-name\", help=\"The bucket name\", required=True)\n",
"\n",
"arguments, unknown = parser.parse_known_args()\n",
"bucket_name = arguments.bucket_name\n",
"\n",
"# Define the format of your input data, including unused columns.\n",
"# These are the columns from the census data files.\n",
"COLUMNS = (\n",
" 'age',\n",
" 'workclass',\n",
" 'fnlwgt',\n",
" 'education',\n",
" 'education-num',\n",
" 'marital-status',\n",
" 'occupation',\n",
" 'relationship',\n",
" 'race',\n",
" 'sex',\n",
" 'capital-gain',\n",
" 'capital-loss',\n",
" 'hours-per-week',\n",
" 'native-country',\n",
" 'income-level'\n",
")\n",
"\n",
"# Categorical columns are columns that need to be turned into a numerical value\n",
"# to be used by scikit-learn\n",
"CATEGORICAL_COLUMNS = (\n",
" 'workclass',\n",
" 'education',\n",
" 'marital-status',\n",
" 'occupation',\n",
" 'relationship',\n",
" 'race',\n",
" 'sex',\n",
" 'native-country'\n",
")\n",
"\n",
"# Create a Cloud Storage client to download the census data\n",
"storage_client = storage.Client()\n",
"\n",
"# Download the data\n",
"public_bucket = storage_client.bucket('cloud-samples-data')\n",
"blob = public_bucket.blob('ml-engine/sklearn/census_data/adult.data')\n",
"blob.download_to_filename('adult.data')\n",
"\n",
"# Load the training census dataset\n",
"with open(\"./adult.data\", \"r\") as train_data:\n",
" raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)\n",
" # Removing the whitespaces in categorical features\n",
" for col in CATEGORICAL_COLUMNS:\n",
" raw_training_data[col] = raw_training_data[col].apply(lambda x: str(x).strip())\n",
"\n",
"# Remove the column we are trying to predict ('income-level') from our features\n",
"# list and convert the DataFrame to a lists of lists\n",
"train_features = raw_training_data.drop(\"income-level\", axis=1).values.tolist()\n",
"# Create our training labels list, convert the DataFrame to a lists of lists\n",
"train_labels = (raw_training_data[\"income-level\"] == \" >50K\").values.tolist()\n",
"\n",
"# Since the census data set has categorical features, we need to convert\n",
"# them to numerical values. We'll use a list of pipelines to convert each\n",
"# categorical column and then use FeatureUnion to combine them before calling\n",
"# the RandomForestClassifier.\n",
"categorical_pipelines = []\n",
"\n",
"# Each categorical column needs to be extracted individually and converted to a\n",
"# numerical value. To do this, each categorical column will use a pipeline that\n",
"# extracts one feature column via SelectKBest(k=1) and a LabelBinarizer() to\n",
"# convert the categorical value to a numerical one. A scores array (created\n",
"# below) will select and extract the feature column. The scores array is\n",
"# created by iterating over the columns and checking if it is a\n",
"# categorical column.\n",
"for i, col in enumerate(COLUMNS[:-1]):\n",
" if col in CATEGORICAL_COLUMNS:\n",
" # Create a scores array to get the individual categorical column.\n",
" # Example:\n",
" # data = [\n",
" # 39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married',\n",
" # 'Adm-clerical', 'Not-in-family', 'White', 'Male', 2174, 0,\n",
" # 40, 'United-States'\n",
" # ]\n",
" # scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
" #\n",
" # Returns: [['State-gov']]\n",
" # Build the scores array\n",
" scores = [0] * len(COLUMNS[:-1])\n",
" # This column is the categorical column we want to extract.\n",
" scores[i] = 1\n",
" skb = SelectKBest(k=1)\n",
" skb.scores_ = scores\n",
" # Convert the categorical column to a numerical value\n",
" lbn = LabelBinarizer()\n",
" r = skb.transform(train_features)\n",
" lbn.fit(r)\n",
" # Create the pipeline to extract the categorical feature\n",
" categorical_pipelines.append(\n",
" (\n",
" 'categorical-{}'.format(i), \n",
" Pipeline([\n",
" ('SKB-{}'.format(i), skb),\n",
" ('LBN-{}'.format(i), lbn)])\n",
" )\n",
" )\n",
"\n",
"# Create pipeline to extract the numerical features\n",
"skb = SelectKBest(k=6)\n",
"# From COLUMNS use the features that are numerical\n",
"skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]\n",
"categorical_pipelines.append((\"numerical\", skb))\n",
"\n",
"# Combine all the features using FeatureUnion\n",
"preprocess = FeatureUnion(categorical_pipelines)\n",
"\n",
"# Create the classifier\n",
"classifier = RandomForestClassifier()\n",
"\n",
"# Transform the features and fit them to the classifier\n",
"classifier.fit(preprocess.transform(train_features), train_labels)\n",
"\n",
"# Create the overall model as a single pipeline\n",
"pipeline = Pipeline([(\"union\", preprocess), (\"classifier\", classifier)])\n",
"\n",
"# Create the model file\n",
"# It is required to name the model file \"model.pkl\" if you are using pickle\n",
"model_filename = \"model.pkl\"\n",
"with open(model_filename, \"wb\") as model_file:\n",
" pickle.dump(pipeline, model_file)\n",
"\n",
"# Upload the model to Cloud Storage\n",
"bucket = storage_client.bucket(bucket_name)\n",
"blob = bucket.blob(model_filename)\n",
"blob.upload_from_filename(model_filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submit the training job\n",
"\n",
"In this section, you use [`gcloud ai-platform jobs submit training`](https://cloud.google.com/sdk/gcloud/reference/ai-platform/jobs/submit/training) to submit your training job. The `--` argument passed to the command is a separator; anything after the separator will be passed to the Python code as input arguments.\n",
"\n",
"For more information about the arguments preceeding the separator, run the following:\n",
"\n",
" gcloud ai-platform jobs submit training --help\n",
"\n",
"The argument given to the python script is `--bucket-name`. The `--bucket-name` argument is used to specify the name of the bucket to save the model file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import time\n",
"\n",
"# Define a timestamped job name\n",
"JOB_NAME = \"census_training_{}\".format(int(time.time()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Submit the training job:\n",
"!gcloud ai-platform jobs submit training $JOB_NAME \\\n",
" --job-dir gs://$BUCKET_NAME/census_job_dir \\\n",
" --package-path ./census_training \\\n",
" --module-name census_training.train \\\n",
" --region us-central1 \\\n",
" --runtime-version=1.12 \\\n",
" --python-version=3.5 \\\n",
" --scale-tier BASIC \\\n",
" --stream-logs \\\n",
" -- \\\n",
" --bucket-name $BUCKET_NAME"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Verify model file in Cloud Storage\n",
"\n",
"View the contents of the destination model directory to verify that your model file has been uploaded to Cloud Storage.\n",
"\n",
"Note: The model can take a few minutes to train and show up in Cloud Storage."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!gsutil ls gs://$BUCKET_NAME/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Serve the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the model is successfully created and trained, you can serve it. A model can have different versions. In order to serve the model, create a model and version in AI Platform.\n",
"\n",
"Define the model and version names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_NAME = \"CensusPredictor\"\n",
"VERSION_NAME = \"census_predictor_{}\".format(int(time.time()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the model in AI Platform:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!gcloud ai-platform models create $MODEL_NAME --regions us-central1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a version that points to your model file in Cloud Storage:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!gcloud ai-platform versions create $VERSION_NAME \\\n",
" --model=$MODEL_NAME \\\n",
" --framework=scikit-learn \\\n",
" --origin=gs://$BUCKET_NAME/ \\\n",
" --python-version=3.5 \\\n",
" --runtime-version=1.12"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Make predictions\n",
"\n",
"### Format data for prediction\n",
"\n",
"Before you send an online prediction request, you must format your test data to prepare it for use by the AI Platform prediction service. Make sure that the format of your input instances matches what your model expects.\n",
"\n",
"Create an `input.json` file with each input instance on a separate line. The following example uses ten data instances. Note that the format of input instances needs to match what your model expects. In this example, the Census model requires 14 features, so your input must be a matrix of shape (`num_instances, 14`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define a name for the input file\n",
"INPUT_FILE = \"./census_training/input.json\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%writefile $INPUT_FILE\n",
"[25, \"Private\", 226802, \"11th\", 7, \"Never-married\", \"Machine-op-inspct\", \"Own-child\", \"Black\", \"Male\", 0, 0, 40, \"United-States\"]\n",
"[38, \"Private\", 89814, \"HS-grad\", 9, \"Married-civ-spouse\", \"Farming-fishing\", \"Husband\", \"White\", \"Male\", 0, 0, 50, \"United-States\"]\n",
"[28, \"Local-gov\", 336951, \"Assoc-acdm\", 12, \"Married-civ-spouse\", \"Protective-serv\", \"Husband\", \"White\", \"Male\", 0, 0, 40, \"United-States\"]\n",
"[44, \"Private\", 160323, \"Some-college\", 10, \"Married-civ-spouse\", \"Machine-op-inspct\", \"Husband\", \"Black\", \"Male\", 7688, 0, 40, \"United-States\"]\n",
"[18, \"?\", 103497, \"Some-college\", 10, \"Never-married\", \"?\", \"Own-child\", \"White\", \"Female\", 0, 0, 30, \"United-States\"]\n",
"[34, \"Private\", 198693, \"10th\", 6, \"Never-married\", \"Other-service\", \"Not-in-family\", \"White\", \"Male\", 0, 0, 30, \"United-States\"]\n",
"[29, \"?\", 227026, \"HS-grad\", 9, \"Never-married\", \"?\", \"Unmarried\", \"Black\", \"Male\", 0, 0, 40, \"United-States\"]\n",
"[63, \"Self-emp-not-inc\", 104626, \"Prof-school\", 15, \"Married-civ-spouse\", \"Prof-specialty\", \"Husband\", \"White\", \"Male\", 3103, 0, 32, \"United-States\"]\n",
"[24, \"Private\", 369667, \"Some-college\", 10, \"Never-married\", \"Other-service\", \"Unmarried\", \"White\", \"Female\", 0, 0, 40, \"United-States\"]\n",
"[55, \"Private\", 104996, \"7th-8th\", 4, \"Married-civ-spouse\", \"Craft-repair\", \"Husband\", \"White\", \"Male\", 0, 0, 10, \"United-States\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Send the online prediction request\n",
"\n",
"The prediction results return `True` if the person's income is predicted to be greater than $50,000 per year, and `False` otherwise. The output of the command below may appear similar to the following:\n",
"\n",
" [False, False, False, True, False, False, False, False, False, False]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!gcloud ai-platform predict --model $MODEL_NAME --version \\\n",
" $VERSION_NAME --json-instances $INPUT_FILE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Clean up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To delete all resources you created in this tutorial, run the following commands:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Delete the model version\n",
"!gcloud ai-platform versions delete $VERSION_NAME --model=$MODEL_NAME --quiet\n",
"\n",
"# Delete the model\n",
"!gcloud ai-platform models delete $MODEL_NAME --quiet\n",
"\n",
"# Delete the bucket and contents\n",
"!gsutil rm -r gs://$BUCKET_NAME\n",
"\n",
"# Delete the local files created by the tutorial\n",
"!rm -rf census_training"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}